Measuring the similarity of charts in graphical statistics

Figures used in statistics and other sciences play a vital role in understanding and analyzing the problems under study. Due to the complexity and diversity of these problems, figures such as cartograms, choropleth maps, or radar charts take various geometric forms. Their visual evaluation from the view of geometric similarity is essential but insufficient. This paper proposes and theoretically justifies new metrics based on graph theory. They make it possible to quickly determine the degree of similarity of the statistical figures used in the research procedure. The new metrics were used to 1. Determine the similarity of the domestic route networks of major U.S. airlines, 2. Determine the similarity of the distribution of votes cast in U.S. presidential election in each state in 2016 and 2020, 3. Compare radar charts of some countries, constructed based on the Global Competitiveness Index, 4. Analyze the similarity of neutrosophic double line graphs representing sets of approximate (neutrosophic) numbers. This improves analytical capabilities concerning various processes mapped with well-known types of statistical charts, such as choropleth maps, radar charts, etc.


New metrics between graphical structures
In its simplest form, a network is a collection of points joined together in pairs by lines, which is appropriate here.The points are referred to as vertices and the lines as edges.Many objects of interest in the physical, biological, social, and geographical sciences can be called networks.
Several mathematical models of networks have been implemented (see 17 ).Traditional models, such as random graphs and their extensions, mimic the patterns of connections in real networks.The fundamental paper of 18 initiated essential research on random graphs and their applications, including the contribution of Erdös and Palka's papers 19,20 .In contrast to the random approach, we will apply here the most basic network model, namely the simple graph introduced by Euler 21 in 1736.
A simple graph G = (V , E) is a pair of two finite sets, namely a non-empty set of vertices V and a set of edges E, which is a subset of unordered pairs of vertices from V. In particular, the set of edges can be empty; in that case G is called a null graph.We will adopt the following labeling convention.In mathematical formulae and inequalities, and only there, the symbol V stands for |V|-the number of vertices, and E stands for |E|-the number of edges.This convention allows mathematical formulae to be written in a form that is easier to read and does not cause ambiguity.In graphical statistics, a question naturally arises about the distance between given graphs.
Let us consider two graphs . The choice of a metric between these graphs depends on the particular problems under investigation.For example, in a paper by Baláž et al. 22 , issues from organic chemistry were considered.To define the distance between graphs representing chemical structures, they used as a base concept the joint edges of the graphs under consideration, namely where E 1 , E 2 are the numbers of edges of graphs G 1 and G 2 , respectively, E (1,2) is the number of common edges in those graphs, and |V 1 − V 2 | is the absolute value of the difference of the numbers of vertices in those structures.This metric is useful in determining the similarity of graphs in a case when the distribution of edges is important-as in chemical structures.
In applications in geographical and other social sciences, in many cases we are dealing with graphical structures without any connections.In this case, Baláž's metric (1) is useless, since the absolute value of the difference of the numbers of vertices in such structures does not correctly characterize geographical properties in practical considerations.Furthermore, from a geographical point of view, two subgraphs of a given graph may be treated as identical, even though from the point of view of classical graph theory, those structures are topologically different.To be more precise, in our investigations, two subgraphs representing geographical structures with a common vertex set and the same number of edges will be treated as identical, so the distance between them must be zero.This is not guaranteed by the metric (1).
Consequently, a new metric, denoted by PRW, between graphs G 1 and G 2 was proposed in a paper by Palka et al. 23 in which the geographical aspect of the graph is taken into account.A fundamental property of geographic graphs is that their description considers the proper names of the elements of their structure, i.e., edges or vertices.In general, the names describing the vertices of geographic graphs are more important than the names of edges.Here, instead of the notation PRW, we will use the Greek letter δ .The primary role in our metric is played by the symmetric difference of the vertex sets V 1 and V 2 and the absolute value of the difference of the numbers of edges in those structures.The symmetric difference of sets , and is visualized using a Venn diagram in Fig. 1.
The metric between graphs G 1 and G 2 is defined by Palka et al. 23 as follows: where V 1△2 denotes the number of vertices in the symmetric difference of sets V 1 and V 2 .Note that only two graph parameters determine the value of this metric: the numbers of vertices and edges of the graphs being considered.Furthermore, it is easy to check that δ(G 1 , G 2 ) = 0 if and only if V 1 is the same as V 2 and both graphs have the same number of edges, i.e.E 1 = E 2 , which is consistent with our discussion of the similarity of graphs representing geographical structures.
Since the symmetric difference V 1 △V 2 can be expressed as we have where V (1,2) stands for the number of common vertices in those graphs.Finally, we obtain our distance in a more convenient form, namely The property of symmetry of δ is obvious, since δ(G 1 , G 2 ) = δ(G 2 , G 1 ).Thus, we present a formal proof that for three given graphs G 1 , G 2 , and G 3 , the distance δ satisfies the triangle inequality, i.e.
Clearly since |a − b| is a metric on the real number line.Thus we need to show only that the following inequality holds: After simple modifications, we obtain the inequality It is easy to check that in the case when V 1 ∩ V 3 is the empty set and V 2 is contained in V 1 ∪ V 3 , the left-hand side of this inequality equals zero.In all other cases, its value is at least one.This completes the proof.Consequently, the proposed distance between graphs (in the form 2 or 3) satisfies the necessary properties of a metric.In the case of null graphs, this metric will be denoted as γ and has the following simple form Note that if two graphs are not empty but have the same number of edges, then δ = γ .Nevertheless, we will use the notation γ only in the case of null graphs.
It terms out that in practical applications, dealing with a relative value of the distance δ or γ is more help- ful than their absolute values, as in (3) and (4).Considering the possible applications of the measurement of similarities of geographical subgraphs, we propose in this paper to divide the value of δ and γ by the number of vertices in V 1 ∪ V 2 .Consequently, the formulae for the relative distances δ * and γ * of a given pair of graphs, say G 1 and G 2 , are and respectively.The value of the denominator in ( 5) and ( 6) is greater than zero, since both V 1 and V 2 are non-empty sets.As in the case of the metric δ , the relative distance δ * (G 1 , G 2 ) = 0 if and only if V 1 and V 2 are the same and E 1 = E 2 .Furthermore, the relative distance for null graphs always satisfies the inequality 0 γ * 1.
Let us emphasize again that the value of the metric δ * is determined by two parameters, the numbers of ver- tices and edges of the graphs under consideration, and has nothing to do with their topological structures.In Fig. 2, there are two subgraphs (black and red edges, respectively) on the same vertex set V = {1, 2, . . ., 22} , for which the distance δ * equals zero.This is because both subgraphs have the same number of edges, equal to 21.
A simple transformation of formula (5) provides the following form for our distance: From this formula, it is easy to see that To illustrate this case, let us consider the two graphs shown in Fig. 3.The black graph has 19 vertices and 18 edges, whereas the red graph has 16 vertices and 15 edges.Moreover, the two graphs have 13 common vertices (marked green).Consequently the inequality V (1,2) > |E 1 − E 2 | holds, and by (7) the distance between these graphs is 0.55. 3). 2).

On the other hand
It appears that the value of the relative distance δ * may be substantially large.Indeed, let us consider two graphs where G 1 is a complete graph on the vertex set V 1 , i.e. each pair of vertices from V 1 is connected by an edge, and G 2 is a null graph having one vertex, which is also an element of V 1 .Consequently Thus from (7) we obtain 2) .www.nature.com/scientificreports/if V 1 2 .To illustrate this case, let us consider the two graphs shown in Fig. 4. G 1 is a complete graph on the vertex set {1, 2, 3, 4} , while G 2 is a null graph on a single vertex {4} .By (7) Another task encountered in such fields as international economics, urban economics, socio-economic geography, sociology, etc. is comparison of the socio-economic situation of countries, cities, etc., depicted on a radar chart.This may also be known as a web chart, irregular polygon, star plot polygon, or polar chart.Radar charts have a long history, having been invented by Georg von Mayr in 1877 (see Appendix 2). Figure 5 shows a radar chart of two countries.The image spanned by the values of 100 categories represents an ideal case in the sense that all factors (pillars) are taken into account; for example, some countries are developed to the maximum degree.This is a situation which in reality will probably never occur.However, the question can be posed: what is the distance between specified countries in terms of the given n pillars (in the example in Fig. 5, n = 12 )?Here, we propose to adopt a γ metric for two radar charts, say R 1 and R 2 , rather than two graphs.Instead of taking into account the number of vertices of the graphs, our metric will be based on the areas of corresponding parts of the radar charts.
Let A(F) denote the area of a figure F. Let R i,1 and R i,2 denote the i-th parts of the given radar charts R 1 and R 2 .Then where n is the number of pillars in R 1 and R 2 .
First, let us note that the metric γ (R i,1 , R i,2 ) must be considered separately for each i-th part of the radar charts.Keeping in mind formula (4) and the assumption that the metric for radar charts is based on the area of corresponding parts, we have, for a given i: where A(R i,1 ) and A(R i,2 ) are the areas of R i,1 and R i,2 , respectively, and A(R Let △XYZ denote the triangle with vertices X, Y and Z.We have to analyze two significantly different situations. Case 1.In a given part the lines of the two tested figures do not intersect.For example, in Fig. 5, in the part between the first and second pillars, the red line does not cross the green line.This situation is simple to analyze.As shown in Fig. 6a, in this case we have two triangles, say △Q 1 OQ 2 and △P 1 OP 2 , of which the second is properly contained in the first.Thus, by (9), (In practical applications the number of the part of the charts, i.e. the value of i, will be known.) Case 2. In a given part the lines of the two tested figures intersect.For example, in Fig. 5, in the part between the ninth and tenth pillars, the red line crosses the green line.This situation is somewhat more involved to analyze than Case 1.Nevertheless, as is shown in Fig. 6b: Consequently, by ( 9)   www.nature.com/scientificreports/and finally we have for Case 2, for this particular part of the charts, In the case of the metric γ * , let us assume that we are dealing with m radars R 1 , R 2 , . . ., R m .Let be the largest value of metric γ .Then for a given pair of radars-R k , R l , say-we define the metric γ * as follows: In socio-economic studies and many others, there are very often situations where available sets of numerical data are ambiguous.Then for example, neutrosophic statistic tools can be used-including neutrosophic statistical graph (see [24][25][26][27][28] ).Their spatial structure can be very different.Hence assessing the mutual similarity of such figures can be difficult.The metric derived in this paper make it easy to determinate the degree of similarity between netrosophic graphs.
Based on the determination of the metric γ for radar maps, we will now describe an idea of applying our approach to asses the "proximity" of the data represented by uncertain numbers.
In the first step we define a metric between given sets of points on the plane.
Our goal is to propose a metric between N 1 and N 2 , which will be based on a metric between polygons.A crucial point in our considerations is as follows.Instead of using metric (13) directly to the sets N 1 and N 2 , we will consider more sophisticated approach, namely we take into account the minimum and maximum values of uncertain numbers and create the four sets of plane points: The application of newly defined metric γ for neutrosohic numbers is outlined in Section "Neutrosophic double line graphs".

Applications of the new metrics Graphs
Graphs describe spatial relations using various metrics, often understood as distance functions.They also help determine, for example, the accessibility of certain spatial points, the spatial structure of objects consisting of points and connecting lines, etc. (e.g. 29 ).In some scientific work, for example in the procedure of grouping the objects under study due their structural similarity it is necessary to determine the degree of similarity of such objects.The proposed distances δ and δ * can be used to achieve this goal.We illustrate this by comparing the structural similarities of three major U.S. airlines.It is virtually impossible to determine visually the similarity or dissimilarity of the connection networks of these airlines; see Fig. 9.It is, however, feasible if the δ and δ * metrics are used.
Based on the data in Table 1, namely E, V , V (1,2) , V (1,3) , V (2,3) , one can easily determine the degree of similar- ity between the domestic connection networks offered by these airlines.This degree of similarity is determined by the numerical values of the metrics δ and δ * .It can be concluded that in terms of structure, the connection   networks of American Airlines and Delta differ the most.On the other hand, the greatest similarity is found between the network structures of Delta and United Airlines.It should be added that the numerical values of the metrics can, of course, be used in various kinds of studies and reports on the spatial optimization of airline connections.
Especially when new air routes are planned and the problem of competition between airlines arises.It should be notes that the metrics used here, can be used to analyze the similarity of the structure of various network like objects.

Choropleth maps
In spatial economics there is often a need to compare various spatial structures, for example, in the form of choropleth maps (see Appendix 2). Figure 10 shows three choropleth maps depicting the same region, whose seven internal spatial units are categorized into four spatial types: A, B, C, and D (In cartography, charts in the form of choropleth maps are also known as cartograms proper, because their scale is discontinuous (discrete).).Comparative analysis requires establishing the similarity between the objects-preferably through an explicitly defined distance.Both γ and γ * can be used for this purpose.It is clear that the regions 1, 2 and 3 in Fig. 10 can be considered as three null graphs with the same number of vertices, namely 7, and different numbers of common vertices.Thus, for example: γ (1, 2) = 7 + 7 − 2 • 3 = 8 , while γ * (1, 2) = 8/11 = 0.73 .In turn, γ (1, 3) = 4 , γ * (1, 3) = 0.44 , γ (2, 3) = 8 and γ * (2, 3) = 0.73 .The result confirms the visual assessment according to which choropleth maps 1 and 3 are the most similar in terms of spatial structure.
Our next application deals with the 2016 U.S. presidential election, in which the Democratic Party's candidate was Hilary Clinton and the Republican Party's candidate was Donald Trump.The choropleth maps in Fig. 11 illustrate numbers of popular votes cast for both candidates.It is easy to see the great spatial variation in these figures, as quantified by the metrics γ and γ * given in Table 2. Thus, it is known that the election results in individual Table 1.Numbers of vertices and edges of the networks of connections of major U.S. airlines and the similarity between them expressed by distance.Source: Authors' calculation.www.nature.com/scientificreports/states for the Clinton-Trump contest in 2016 were less similar to each other than for the Biden-Trump contest in 2020.The metrics γ and γ * also enable an extended analysis of the results of the 2016 and 2020 presidential elections.It can be noted, for example, that when the same candidate-Donald Trump, in this case-runs in successive elections, the results obtained by him in individual states in 2020 are not a faithful copy of the results from the previous election, because the values γ = 8 and γ * = 0.145 are very small.Election analysts can derive many more conclusions based on the values summarized in Tables 2 and 3 or others that can be constructed based on the γ and γ * metrics.Particularly noteworthy, therefore, is the fact that quantification of the differences that occur between analyzed images-here choropleth maps-creates the possibility of further analysis using quantitative methods, which are very important in political and geopolitical analysis, for example.
The results in Table 3 allows us to conclude that: • The choropleth maps showing the results of voting in each state in 2016 and 2020 for candidate Trump are the most similar.The corresponding values are γ = 8 and γ * = 0.145 .At the same time, it can be noted that not all states in 2020 voted for candidate Trump as in 2016.• In contrast, the largest disparity between election results is found for candidates Clinton and Trump in 2016 ( γ = 84 and γ * = 0.93 ).It is larger than that between candidates Trump and Biden in 2020 ( γ = 80 and γ * = 0.879 ).One can try to determine why?
It should be emphasized at this point, that the identification of the degree of similarity between choropleth maps in numerical form creates the possibility of futher in-depth numerical analysis.

Cartograms
Presidential elections in the U.S. are in fact two-tiered: the President is elected by a college of electors representing each state.Hence, in assessing the influence of individual states on the final outcome of the elections, the electoral strength characterizing each state is an important factor.It can be determined as proposed by 13 using the formula The results obtained for the 2016 and 2020 presidential elections are summarized in Table 4.The corresponding cartograms are shown in Fig. 12 30,31 .
The indicator (15) is highly dependent on the number of popular votes for each state, which in turn is dependent on the number of residents of the state.Thus, as can be easily seen, the highest electoral vote power is found in such sparsely populated states as Wyoming, Vermont, Alaska, District of Columbia, etc., and the lowest in Florida, North Carolina, Colorado, etc., where the number of residents is large.The γ and γ * metrics help determine the degree of similarity of the cartogram constructed for 2016 to the cartogram for 2020.The numerical values of these metrics are as follows: γ = 22 , γ * = 0.355 .They confirm the relatively high similarity of the two cartograms.

Radar charts
To illustrate the proposed metrics γ , γ * for establishing the geometrical similarity of radar charts, a set of nine countries with similar values of the competitiveness coefficient (GCI) was selected.These were the countries ranked from 35 to 43, with 4.5 GCI 4.7 (see WEF 2017-2018).Their radar charts are shown in Fig. 13.The complexity of this figure and the difficult in comparing the different radar charts with each other are readily apparent.Use of the metrics γ and γ * makes it easier to determine the similarity and allows further detailed comparative analysis.
Table 5 includes the above-mentioned information on the nine selected countries.The table also contains the distances between their radar charts in terms of γ * .
It may be noted that the GCI values suggest dividing the set of countries into only three subsets, i.e. {Azerbaijan, Indonesia}, {Malta, Russian Federation, Poland, India, Lithuania, Portugal}, and {Italy}.In contrast, the  Finally, according to the proposed formula (14), the distance between the given sets of uncertain numbers γ (N 1 , N 2 ) = 5.64239.
If we have more than two sets of uncertain numbers, using the normalized metric γ * to compare such numbers is more advantageous.After determining the metric γ for each pair of sets of uncertain numbers, we normalize it by the value of the largest of them.
As an example, let us consider two additional sets: N 3 = {8 + 1.5, 9 + 1.0, 2 + 1.25, 10 + 2.0, 5 + 2, 25} and N 4 = {1 + 0.75, 5 + 0.5, 2 + 1.0, 4 + 1.5, 8 + 2.25} .Then, we have six possible pairs (see Fig. 15), for which we calculate γ metrics.Proceeding as in the example above, we determine areas of polygons for each pair and calculate the γ metric, according to formula (14).Then, we normalize each of them by dividing its value by the largest γ .The relevant results are summarized in Table 7.In the considered example, the farthest from each other in the sense of our proposed metric are the sets N 3 , N 4 , ( γ * (N 3 , N 4 ) = 1 ), while the closest are the sets N 1 , N 2 ( γ * (N 1 , N 2 ) = 0.3995 ).This is consistent with the visual assessment of the mutual position of these sets in Fig. 15, but more accurate.

Conclusions
In the field of statistics, and graphical statistics in particular, many types of chart have been developed to facilitate the understanding and depiction of the relationships occurring in time and space between the various phenomena and factors under study.Some of them are especially frequently used, such as cartograms or choropleth maps.Figures depicting the variability of a phenomenon-for example, over time-show a certain degree of similarity.How can we determine this degree of similarity objectively?This work has provided an answer to that question.The metric δ , constructed by the authors, and its standardized form δ * make it possible to determine the degree of similarity of statistical figures by determining the specific distance between them.In this way, the unavoidable subjectivity associated with the visual evaluation of statistical charts is successfully eliminated-in particular, when the metrics γ and γ * are also used to assess similarity.Table 6.Values of γ metric for pairs of polygons corresponding to N 1 and N 2 .Source: Authors' calculation.This assertion has been confirmed by the empirical analyses carried out in this paper, concerning the similarity of specific graphs, radar charts, choropleth maps and neutrosophic double line graphs that provide geometric representations of studied phenomena.
Also worthy of note is the simplicity of the proposed metrics, and thus the ease with which their numerical values can be calculated.
In many situations it is not necessary to use computers and often expensive software to determine these values.Therefore, we hope that they will prove useful in statistical, economic, geographical, social and other analyses.

Figure 5 .
Figure 5. Radar map of two countries.Source: own compilation.

Figure 6 .
Figure 6.Two different cases of intersection of radar maps.Source: own compilation.

Figure 9 .
Figure 9.The networks of domestic connections of major U.S. airlines in 2022.Source: Own compilation.

Figure 10 .
Figure 10.Choropleth maps showing a region whose internal units are classified into different types.Source: Own compilation.

Figure 11 .Table 2 .
Figure 11.Percentage of popular vote in each state of the USA in 2016 and 2020.Source: Own compilation.

( 15 )Table 3 .
Vote power = Number of electoral votes Number of popular votes mean Number of electoral votes Number of popular votes .Distances between choropleth maps showing the results of the 2016 and 2020 U.S. presidential elections for candidates of the same party and candidates of different parties.Source: Own compilation.

Figure 14 .
Figure 14.Cluster analysis of nine countries by the Ward method using the γ * metric.Source: own compilation.

Table 4 .
Electoral vote power of U.S. states in 2016 and 2020 presidential elections.Source: Authors' calculation.

Table 5 .
Global Competitiveness Index of each country, their ranks, and distances between radar charts.Source: Authors' calculation.